LemmaGen: Multilingual Lemmatisation with Induced Ripple-Down Rules
نویسندگان
چکیده
Lemmatisation is the process of finding the normalised forms of words appearing in text. It is a useful preprocessing step for a number of language engineering and text mining tasks, and especially important for languages with rich inflectional morphology. This paper presents a new lemmatisation system, LemmaGen, which was trained to generate accurate and efficient lemmatisers for twelve different languages. Its evaluation on the corresponding lexicons shows that LemmaGen outperforms the lemmatisers generated by two alternative approaches, RDR and CST, both in terms of accuracy and efficiency. To our knowledge, LemmaGen is the most efficient publicly available lemmatiser trained on large lexicons of multiple languages, whose learning engine can be retrained to effectively generate lemmatisers of other languages.
منابع مشابه
Learning Ripple down Rules for Efficient Lemmatization
The paper presents a system, LemmaGen, for learning Ripple Down Rules specialized for automatic generation of lemmatizers. The system was applied to 14 different lexicons and produced efficient lemmatizers for the corresponding languages. Its evaluation on the 14 lexicons shows that LemmaGen considerably outperforms the lemmatizers generated by the original RDR learning algorithm, both in terms...
متن کاملA Rule based Approach to Word Lemmatization
Lemmatization is the process of finding the normalized form of a word. It is the same as looking for a transformation to apply on a word to get its normalized form. The approach presented in this paper focuses on word endings: what word suffix should be removed and/or added to get the normalized form. This paper compares the results of two word lemmatization algorithms, one based on if-then rul...
متن کاملA Lemmatization Web Service Based on Machine Learning Techniques
Lemmatization is the process of finding the normalized form of words from surface word-forms as they appear in the running text. It is a useful pre-processing step for any number of language engineering tasks, esp. important for languages with rich inflection morphology. This paper presents two approaches to automated word lemmatization, which both use machine learning techniques to learn parti...
متن کاملMultiple Classification Ripple Down Rules : Evaluation and Possibilities
Ripple Down Rules (RDR) is a knowledge acquisition method which constrains the interactions between the expert and a shell to acquire only correct knowledge. Although RDR works well, it is only suitable for the problem of providing a single classification for a set of data. Multiple Classification Ripple Down Rules (MCRDR) is an extension of RDR which allows multiple independent classifications...
متن کاملRipple Down Rules, a practical method of learning from code rewrites
Software developers sometimes rewrite large sections of program code. Reducing the number of rewrites would save valuable development time. Ripple Down Rules (RDR) has a proven knowledge acquisition track record. RDR looks to offer a simple to maintain method capturing knowledge gained through experience. RDR allows recommendations identified when a failure occurs, to be captured and reused. Th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- J. UCS
دوره 16 شماره
صفحات -
تاریخ انتشار 2010